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ABSTRACT 

Recently sequential model based optimization (SMBO) 
has emerged as a promising hyper-parameter optimiza¬ 
tion strategy in machine learning. In this work, we inves¬ 
tigate SMBO to identify architecture hyper-parameters of 
deep convolution networks (DCNs) object recognition. 
We propose a simple SMBO strategy that starts from a 
set of random initial DCN architectures to generate new 
architectures, which on training perform well on a given 
dataset. Using the proposed SMBO strategy we are able 
to identify a number of DCN architectures that produce 
results that are comparable to state-of-the-art results on 
object recognition benchmarks. 

Index Terms — hyper-parameter optimization, deep 
convolution networks, sequential model based optimiza¬ 
tion 

1. INTRODUCTION 

The primary task for a supervised machine learning algo¬ 
rithm is to use training dataset {x tri yt r } to find a func¬ 
tion f : x —> y, that also generalize well across the test 
(or the hold out) dataset {xh,yh}- Very often / is ob¬ 
tained through the optimization of a training criterion, C, 
with respect to a sets of parameters, 0. The learning algo¬ 
rithm used to optimize C usually contains its own set of 
free parameters A/, referred to as the learning algorithm 
hyper-parameters. These hyper-parameters are often es¬ 
timated using grid search cross validation. In addition to 
the learning algorithm hyper-parameters. A/, neural net¬ 
work models such as deep convolution networks (DCNs) 
also comprise of hyper-parameters A a ^ A/, that de¬ 
fine the architectural configuration of the network. Grid 
search techniques are prohibitively expensive to tune A a , 
given the fact that there are a few tens of these architec¬ 
tural hyper-parameters. As a result, many of the state- 
of-the-art DCNs are manually designed, making the task 
of tuning these hyper-parameters more of an art than a 
science Cl¬ 


in recent years, there has been a concerted effort 
in the machine learning community to develop better 
algorithms to solve the hyper-parameter optimization 
problem mm la 010. Many of these works have suc¬ 
cessfully applied direct search methods for non-linear 
optimization such as the sequential model based opti¬ 
mization (SMBO) to generate better results on various 
supervised machine learning tasks than were previously 
reported. Motivated by these works, in this paper we 
attempt to address the question: Can SMBO be used 
to determine superior architectural configurations for 
DCNs? The paper is organized as follows: In the Meth¬ 
ods Section 2, we will briefly formulate the problem 
of hyper-parameter optimization for DCNs. We then 
present the general strategy of sequential model based 
optimization (SMBO) |@) and summarize our approach 
to SMBO for designing DCN architectures. The results 
of image classification training and evaluation on the 
benchmark CIFAR-10 dataset are then presented in the 
Results Section 3, which is followed by the Conclusion. 


2. METHODS 

2.1. Formulation of the problem 

Let M\(x. w) represent the DCN model that operates on 
input data x G R D and generates an estimate y for the 
output data y £ Iff . The DCN model, M, is parameter¬ 
ized by two set of parameters, the first being w, which are 
obtained through the optimization of a training criterion, 
C, using a gradient descent type learning algorithm such 
as the back-propagation algorithm and the second being 
A = {A a , A;}, which represent the set of the so called 
hyper-parameters. The hyper-parameters A Q define the 
DCN architecture and the hyper-parameters A;, are asso¬ 
ciated with the learning algorithm used to optimize C. 
The objective for DCN hyper-parameter optimization is 
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to solve the joint optimization problem as stated below: 
{it;. A} = argmin [\&(yh, Uh)\ where, 

A 

y h = (argmin (C(x tr , ytn 0)) , x h ) ( 1 ) 

W 

where {(x tr , ytr), (x h , Vh)}£ (x,y), are the input 
and output training and hold-out (or the test) data set re¬ 
spectively, 4- = £ { „ h} Witt- 


2.2. Sequential model based optimization 

SMBO is a direct search method for non-linear optimiza¬ 
tion, in which one begins by selecting a meta-model of 
the function for which an extrema is sought. One then 
applies an active learning strategy to select a query point 
that provides the most potential to approach the extrema. 
More specifically, let us assume that we have a database 
D\-.t = {Ai : t,ei :t } of t DCN models, where Aj || =1 rep¬ 
resents the set of hyper-parameters and e ,|* =1 is the val¬ 
idation error on the hold-out dataset generated by each 
of the t DCN models. The basic idea underlying SMBO 
is to replace the original optimization problem of find¬ 
ing extrema of a given function, such as 4/(A), which is 
time consuming and computationally expensive, with an 
equivalent problem of optimization of expected value of 
an utility function, u{e) a. As we will see below, op¬ 
timizing over the expected value of the utility function 
is computationally much cheaper and faster than solving 
the original problem. Under SMBO, one usually begins 
by assigning a prior distribution p(e) on e. One then uses 
the database D\. t to obtain an estimate for the likelihood 
function p(Ai : t|e). The prior and the likelihood function 
are used to obtain an estimate for the posterior distribu¬ 
tion p(e|Ai :t ) oc p(Ai ; t|e)p(e). The objective then is to 
choose At+i, which maximizes u(e) under p(e|Ai : t), i.e., 
At+i = argmax (E(it(e))], where E(u(e)) is given as: 

A 


E{u(e)) = 


A common choice for the utility function u(e), 
is the Expected Improvement function El, u(e) = 
max ((e* — e), 0), in which case, Eq. 4 becomes 

E(u(e)) = [ (e* — e)p(e\Xi :t )de (3) 

Jo 

Then under SMBO we have. 


u(e)p(e\Xi:t)de 
p{Xi :t \e)p{e) 


i(e)- 


P(Al:t) 


de 


( 2 ) 


A*+i = argmax 
A 


e* 

jP(Al:t) 



p(Xi :t \e)p(e)de 


(4) 


HyperOpt ( D 1:U p , N): 

While t < N do: 

1. Estimate e* s.t. p(e\-.t < e*) = 0.5 

2 . Use e* and Ai : t to estimate Z(A) and g(X) 

3. Evaluate A t+ i according to Eq. 5 and Eq. 6 

4. Train the new DCN model with A i+ j to estimate e*+i 

5. t i — t T 1 

Table 1. SMBO algorithm for estimating architecture 
hyper-parameters for DCN. 


If p(e < e*) = 7 (a constant), i.e., choose e* to be some 
quantile of observed e values, and we define two density 
functions; ((A) = p(Ai : t|e) when e < e* and g{ A) = 
p(Ai:t|e) when e > e* as proposed in fTj], then. 


At+i = argmax 
A 


e* 7 ?(A) 

_ 7 ((A) + (1 - 'y)g(X)_ 


(5) 


Bergstra et.al., [T], proposed an adaptive Parson es¬ 
timator algorithm to evaluate Eq. 5 so as to maximize 
the ratio l(X)/g(X). In this work, we propose a simple 
strategy to evaluate Eq. 5 as follows: let ((A) + g(X) = 
U (A) = k (a constant). In other words, A is drawn from 
a uniform distribution. Then, if 7 = 0.5, we have from 
Eq. 5 


At+i = —argmax [((A)] ( 6 ) 

& A 

Our proposed simplification in Eq. 6 , does not require 
an estimate for both Z(A) and g( A). We can simply pick 
A £ U( A) and evaluate according to empirical distri¬ 
bution of l( A) generated from D\ :t to evaluate the next 
potential A t+ i. Since the proposed algorithm chooses 
A £ U{ A) at every step, a much larger space of poten¬ 
tial A’s are explored, which may in turn slow down the 
convergence rate to an optimal A*. In order to counter 
this,we adopt a hybrid strategy by combining Eq. 5 and 
Eq. 6 . With probability p, at every iteration, we choose 
the potential A*+i according to Eq. 5 and with probabil¬ 
ity 1 — p, we choose the potential At+i according to Eq. 

6 . In Table 1, we summarize the steps involved in our 
proposed algorithm. 


3. RESULTS 

We evaluate our proposed SMBO algorithm on the 
CIFAR-10 benchmark, which consists of 60,000 32x32 
color images. The collection is split into 50,000 train¬ 
ing and 10,000 testing images. All the DCNs generated 
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by our proposed algorithm were trained using cuda- 
convneti We used a 3 step cooling procedure; starting 
with learning rate l = 0.01, the momentum m = 0.9, 
the weight decay parameter w c = 0.0005 for the first 
120 epochs followed by another 20 epochs by reducing 
learning rate by a factor of 10 (keeping other parame¬ 
ters the same) and then training for 10 more epochs by 
further reducing learning rate by a factor of 10. 

Since the primary focus for us in this work is to de¬ 
termine whether SMBO can be used to identify suitable 
DCN architectures, we fixed the DCN hyper-parameters 
associated with the back-propagation learning algorithm 
as described above. The set of DCN architecture hyper¬ 
parameters that we consider for optimization are listed in 
Table 2. 


Conv. Layer 

1. No. of Conv. layers 

2. No. of filters per layer 

3. Filter size; 4. Filter stride 

Norm. Layer 

1. Size 

Pool Layer 

1. Size; 2. Stride 

3. type of pooling: max/avg. 

Hidden Layer 

1. No. of hidden layers 

2. No. of nodes per hidden layer 

3. dropout value 


Table 2. DCN architecture hyper-parameters. In addition 
to the parameters listed above; we also consider two ad¬ 
ditional boolean hyper-parameters to represent the pres¬ 
ence or absence of the Norm layer or the pool layer and 
a third boolean hyper-parameter indicating the presence 
or absence of dropout. For normalization layer; we only 
consider local response normalization across filter maps 
0, with a scaling factor of 0.75. 


It has been reported in the literature m that very deep 
networks are difficult to train primarily suffering from 
vanishing gradient problem at larger depths. In order to 
alleviate this problem, for our implementation of SMBO 
for DCN architectural hyper-parameters, all the DCNs 
are generated to comprise a local logistic regression (LR) 
cost function layer at the output of one or more of the 
convolution block. 

For the results presented here, we consider t = 32 as 
the size of our initial database based on our analysis of 
random search hyper-parameter optimization 0 and we 
setp = 0.9. 



Fig. 1. (a) The mean test error and standard deviation 

(in yellow) as function of the SMBO iteration number for 
multi-view mode (b) The minimum multi-view mode test 
error as function of SMBO iteration number 


1 https://code.google.eom/p/cuda-convnet2/ 
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Convolutions 


Fully-connected Test error 


Architecture 

# Trainable Parameters 

Parameters 

Layer 1 

Layer 2 

Layer 3 

Layer 4 

Layer 5 

Layer 6 

Layer 7 

Layer 8 

Layer 1 

Layer 2 

Softmax 

550 epochs 

DCN 1. 

?s30.9 M 

Filter / size 

64x3x3 

256x3x3 

256x3x3 

256x3x3 

256x3x3 

256x3x3 

256x3x3 

256x11x11 

3,314 

4,951 

10 

7.81% 



Stride 

1 

1 

2 

2 

1 

2 

2 

10 







Padding 

0 

0 

1 

1 

0 

1 

1 

9 







Pooling (Size, Stride) 

Normalization 

Dropout 


(2,2) 



/ 


/ 

/ 

/ 

/ 



DCN 2. 

«4.0M 

Filter / size 

128x3x3 

128x3x3 

128x3x3 

256x3x3 

256x3x3 

256x7x7 

NA 

NA 

NA 

NA 

10 

8.17% 



Stride 

Padding 

Pooling (Size, Stride) 

1 

0 

2 

1 

1 

0 

1 

0 

1 

0 

2 

1 









Normalization 


/ 



/ 

/ 









Dropout 













DCN 3. 

«3.4M 

Filter / size 

256x3x3 

128x3x3 

256x3x3 

256x3x3 

256x3x3 

128x7x7 

NA 

NA 

NA 

NA 

10 

8.63% 



Stride 

1 

2 

1 

1 

2 

5 









Padding 

Pooling (Size,Stride) 

0 

1 

0 

0 

1 

2 









Normalization 

/ 














Dropout 














Table 3. Architecture for the top 3 DCNs generated by our proposed SMBO algorithm. 


In Figure [ljt, we plot the mean (std. error, shown in 
yellow) test error (evaluated in multi-view test mode, SSI) 
E(i) = (ej_io:i) and in Figure[lJ), we plot the minimum 
test error M(i) = min(eo : i) as function of the iteration 
number i, respectively. We see that the average test er¬ 
ror gradually decrease towards an optimal solution, the 
best minimum found also decreases with increasing it¬ 
erations. Furthermore, our proposed SMBO procedure 
generated a large number of “good” DCN architectures 
that produce test-error of < 11 % even with only 150 
training epochs (not-reported). In comparison, the best 
hand-tuned DCN architecture, produced by (6) exhibits 
11% test error in multi-view mode and requires a longer 
training time on the order of 500 epochs. 

Since the state-of-the-art performance numbers for 
the CIFAR-10 benchmark dataset are usually reported in 
multi-view mode (with data-augmentation 0) , we re¬ 
port multi-view test error of 7.81% for the best DCN 
generated by our proposed hyper-parameter optimization 
strategy, which compares favorably to the current state- 
of-the-art result on CIFAR-10 of 7.97% g). In Table [3] 
we summarize the top 3 DCN architectures found by our 
proposed SMBO procedure that produced multi-view test 
error < 9% on the CIFAR-10 benchmark. In ??, we sum¬ 
marize the number of parameters in each of these 3 DCN 
models. 

At the time of writing of this manuscript for cam¬ 
era ready version, we came across a recent paper 0, 
that reported multi-view test error of 7.25% on CIFAR- 
10 benchmark, using a hand designed DCN network, 
that is comprised of only convolution layers and has 
1.3 M parameters. Yet another paper Do), reported the 


utility of using parametric-relu neuron as opposed to 
the relu neurons. While none of the optimized DCN 
networks that we report in Table [3] generate better per¬ 
formance numbers than the latest state-of-the-art num¬ 
bers reported in 0, we wanted to determine whether 
the use of parametric-relu neuron can boost the per¬ 
formance of the optimized DCN networks that we 
have identified through the hyper-parameter optimiza¬ 
tion approach. Accordingly, we retrained the smallest 
of the three DCN networks from Table [3] using a ver¬ 
sion of parametric-rectified non-linear neurons of type 
y = ax(x < 0) + y/x[x > 0), where a is a learnable 
parameter, fixed per neuron layer in the DCN network. 
We were able to obtain multi-view test error score of 6.9 
%, which to the best of our knowledge, represents the 
state-of-the-art score for CIFAR-10 benchmark. In Table 
[4j we summarize all the known best in class numbers for 
CIFAR-10 benchmark. 


CIFAR-10 Classification error (with data augmentation) 

Method | Activation Function Type | Error % | Trainable Params 


Maxout 1111 

maxout 

9.38 

>6 M 

Dropconnect 1121 

relu 

9.32 

- 

dasNet [TTi 

maxout 

9.22 

>6 M 

Network in Network 1141 

relu 

8.81 

«1 M 

Deeply Supervised 1151 

relu 

7.97 

«1 M 

All-CNN 00 

relu 

7.25 

«1.3 M 

DCN-3 (Ours) 

p-renu 

6.9 

«3.4M 


Table 4. Comparison of state-of-the-art results for 
CIFAR-10 benchmark; relu: rectified linear unit; p-renu: 
parametric-rectified non-linear unit. 
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4. CONCLUSION 

In this paper, we have proposed a simple SMBO algo¬ 
rithm and a recipe for hyper-parameter optimization of 
DCN architectures. We have demonstrated that SMBO 
can be used to generate a large number of “good” DCN 
architectures, which may then form a backbone for 
further investigations. Our results suggest that indeed 
SMBO can be used to identify superior DCNs. In sum¬ 
mary, our work in this paper in addition to those from 
earlier works mm broaden the scope of the models that 
can be realistically investigated, without the need for the 
researchers to be restricted to manual evaluation of a few 
architectural parameters at any given time. 
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